Our main goal for carrying this project is to examine how to evaluate the “success” of second generation immigrants in the United States in the period 2001-2003 and analyze the factors that contribute to that success during their life-long adaptation process. Moreover, we aimed to build a model that can predict the success index of second-generation immigrants based on family background, education attainment and adaptation process.
The dataset we are using is Children of Immigrants Longitudinal Study (CILS), which can be found via this link: https://toolbox.google.com/datasetsearch/search?query=first%20generation%20students&docid=8QcOTzZuWPdoGrB4AAAAAA%3D%3D.
The data set contains of 665 variables with 5262 observations, representing the responsed and questions surveyed through the whole longitudinal study. The variables can be analyzed by splitting into 3 different groups, corresponding to 3 different surveys that were conducted in total.
The dataset was built from 3 surveys targeting immigrant second generation in the United States, who are, by how the survey is set up, either US-born children of at least one foreign-born parent, or foreign-born children but were then naturalized. During the data cleaning process, we created 2 separte sub-set of the data: one contains all variables from the first 2 surveys, namely “predictors” and one contains variables from the last survey, namely “results”. The first and second survey were conducted in the schools of attendees, while the last survey was done by individual contacts, leading to a noticeable decline in participation.
The first survey started in 1992, getting 5262 responses from 8th and 9th grades children, with a diversity of 77 (original) nationalities. This survey aimed at baseline information on families, demographic characteristics, language use, self-identities, and academic attainment of the attendees
The second survey was conducted 3 years later (1995), when the attenddees finished high schools, and retrieved 4288 responses (81.5% of the first survey). The goal of this follow-up was to examine the evolution of key adaptation outcomes including language knowledge and preference, ethnic identity, self-esteem, and academic attainment over the adolescent years. One important finding of this survey is the percentage of attendees who failed to graduate from high school. The survey also conducted interviews on parents’ outlooks on their children’s future, however for the scope of this project, we decided not to take it into account.
In our project, we ran PCA on both “predictors” to find derived combinations of variables that are potential predictors of a person’s success. After ananlyzing the data, we decided that there are 43 variables that worth futher consideration:
names(predictors) #list of variables
## [1] "caseId"
## [2] "desired.job.prestige.score-1991"
## [3] "GPA"
## [4] "Parent.SES.index-1991"
## [5] "English.Knowledge-1991"
## [6] "Private.school-1991"
## [7] "Houshold.guardians-1991"
## [8] "number.household.members-1991"
## [9] "Sex"
## [10] "Respondent.US.stay.length-1991"
## [11] "Respondent job preference-1991"
## [12] "felt discriminated-1991"
## [13] "Depression-1991"
## [14] "Self-esteem-1991"
## [15] "education expectation-1991"
## [16] "Hours/day on HW-1991"
## [17] "Good grades importance-1991"
## [18] "Reason Dad came to US"
## [19] "Reason Mom came to US"
## [20] "Present living situation-1995"
## [21] "number people living w/respondent-1995"
## [22] "Economic situation/3 year ago-1995"
## [23] "Parent divorced/separated past year-1995"
## [24] "Parent re/married past year-1995"
## [25] "Parent lost job/past year-1995"
## [26] "Respondent ill/disabled past year-1995"
## [27] "Parent died past year-1995"
## [28] "Respondent sex-1995"
## [29] "Respodent US Citizenship-1995"
## [30] "Respondent job classification-1995"
## [31] "Don't feel save at school-1995"
## [32] "attainable education level-1995"
## [33] "Paren education preference-1995"
## [34] "Respondent hour studying-1995"
## [35] "Good grade importance-1995"
## [36] "English Knowledge-1995"
## [37] "Depression-1995"
## [38] "Self-esteem-1995"
## [39] "Familism index-1995"
## [40] "Family cohesion-1995"
## [41] "GPA-1995"
## [42] "Dropped out by 1995"
## [43] "Percent daily school attendance-1995"
## [44] "Private school-1995"
The final survey was conducted in 2001-2003 via mail, with 3613 respondents (68.9% of the first survey). The questionaires were mainly focused on the outcome of adaptation process, measured by educational attainment, employment and occupational status, income, civil status and ethnicity of spouses/partners, political attitudes and participation, ethnic and racial identities, delinquency and incarceration, attitudes and levels of identification with American society, and plans for the future. These 29 variables below are then used in our project as the components of the “sucess index” - a numeric value that we developed specifically for this project, representing a person’s success at the age of 24-26.
names(result) #components of success index
## [1] "caseId" "Residence Own house/aprt"
## [3] "Disabled or Ill" "Highest education completed"
## [5] "Present work situation" "Current job prestige scores"
## [7] "Current occupation satisfaction" "Present income satisfaction"
## [9] "Respodent identity importance" "Respodent detention/jail/prison"
## [11] "Country feels like home" "Respondent health"
## [13] "Has Children" "Average English Skill"
## [15] "MariageStatus"
After manipulating the source dataset, we arrived at a clean dataset that will be used for analysis and model building. We decided to remove all respones with missing value for any of the above-mentioned variables, thus greatly reducing the size of the dataset to 656 observations and 59 variables. The summary of the clean dataset is showed below:
describe(Final[2:55])
## Description of Final[2:55]
##
## Numeric
## mean median var sd valid.n
## Residence Own house/aprt 0.24 0.00 0.18 0.43 645
## Disabled or Ill 0.93 1.00 0.06 0.25 645
## Highest education completed 4.41 5.00 2.75 1.66 645
## Present work situation 7.63 8.00 0.71 0.84 645
## Current job prestige scores 0.06 0.00 0.94 0.97 645
## Current occupation satisfaction 3.77 4.00 1.14 1.07 645
## Present income satisfaction 3.16 3.00 1.25 1.12 645
## Respodent identity importance 2.50 3.00 0.43 0.66 645
## Respodent detention/jail/prison 0.97 1.00 0.03 0.16 645
## Country feels like home 0.99 1.00 0.01 0.12 645
## Respondent health 4.23 4.00 0.66 0.81 645
## Has Children 0.14 0.00 0.12 0.35 645
## Average English Skill 3.86 4.00 0.14 0.37 645
## MariageStatus 0.29 0.00 0.21 0.45 645
## successIndex 3.73 3.73 0.50 0.71 645
## desired.job.prestige.score-1991 63.49 64.00 148.04 12.17 645
## GPA 2.85 2.83 0.67 0.82 645
## Parent.SES.index-1991 -0.01 0.04 0.51 0.72 645
## English.Knowledge-1991 3.78 4.00 0.17 0.41 645
## Private.school-1991 0.00 0.00 0.00 0.00 645
## Houshold.guardians-1991 1.64 1.00 1.91 1.38 645
## number.household.members-1991 4.43 4.00 3.38 1.84 645
## Sex 1.59 2.00 0.24 0.49 645
## Respondent.US.stay.length-1991 1.89 2.00 0.87 0.93 645
## Respondent job preference-1991 8.67 9.00 7.87 2.81 645
## felt discriminated-1991 1.41 1.00 0.24 0.49 645
## Depression-1991 1.62 1.50 0.37 0.61 645
## Self-esteem-1991 3.34 3.40 0.27 0.52 645
## education expectation-1991 4.32 5.00 0.73 0.85 645
## Hours/day on HW-1991 2.64 2.00 1.77 1.33 645
## Good grades importance-1991 1.28 1.00 0.43 0.66 645
## Reason Dad came to US 2.93 2.00 7.64 2.76 645
## Reason Mom came to US 2.39 2.00 2.87 1.69 645
## Present living situation-1995 1.86 1.00 3.16 1.78 645
## number people living w/respondent-1995 4.12 4.00 3.18 1.78 645
## Economic situation/3 year ago-1995 2.59 3.00 0.94 0.97 645
## Parent divorced/separated past year-1995 1.92 2.00 0.07 0.27 645
## Parent re/married past year-1995 1.93 2.00 0.06 0.25 645
## Parent lost job/past year-1995 1.73 2.00 0.20 0.44 645
## Respondent ill/disabled past year-1995 1.92 2.00 0.08 0.27 645
## Parent died past year-1995 1.98 2.00 0.02 0.12 645
## Respondent sex-1995 1.59 2.00 0.24 0.49 645
## Respodent US Citizenship-1995 1.35 1.00 0.23 0.48 645
## Respondent job classification-1995 7.74 8.00 6.18 2.49 645
## Don't feel save at school-1995 3.09 3.00 0.81 0.90 645
## attainable education level-1995 4.38 5.00 0.58 0.76 645
## Paren education preference-1995 4.62 5.00 0.49 0.70 645
## Respondent hour studying-1995 2.97 3.00 2.38 1.54 645
## Good grade importance-1995 1.36 1.00 0.42 0.65 645
## English Knowledge-1995 3.82 4.00 0.13 0.36 645
## Depression-1995 1.67 1.50 0.38 0.62 645
## Self-esteem-1995 3.42 3.50 0.29 0.53 645
## Familism index-1995 1.82 1.67 0.32 0.57 645
## Family cohesion-1995 3.66 4.00 1.02 1.01 645
In order to understand our dataset, we perform Exploratory Data Analysis on the ‘predictors’. A general report of this subset is attached with this report. In particular, we use histogram to see the distribution of values in each variable in the dataset and examines their correlations to each other:
predictors = select(predictors, -`Private school-1995`)
predictors = select(predictors, -`Private.school-1991`)
plot_histogram(predictors)
plot_correlation(predictors)
Looking at the correlation plot, we noticed that the most variables do not strongly correlate with other variables in the dataset, therefore we can be confident that the covariance problem is not likely to occur and these variables would be helpful predictors for our model. However, 42 variables are still a fairly large number to fit in the model. We proceeded by performing PCA on our data:
predictors = na.omit(predictors)
P_predict<-prcomp(predictors[2:42], scale=TRUE)
fviz_pca_biplot(P_predict, repel = FALSE, # Avoid text overlapping (doesn't scale to large datasets)
col.var = "red", # Variables color
col.ind = "black") + theme_minimal()
fviz_eig(P_predict, addlabels = TRUE)
From the scree plot above, we can see that the first five principle components contain 33.9% of the variability in the data (with the first really standing out) before the leveling-off point. This suggests plotting that the data in five derived-dimensions is an effective summary of all of the variables. The next step in our analysis is to investigate which variables are contributing heavily to each of these six components. Looking at the top 6 components for each dimension in these plot of contributions below, the most-contributed components in each dimension:
fviz_contrib(P_predict, choice = "var", axes = 1, top = 6)
PC1: respondent’s education during both secondary school and high school on scale 4
fviz_contrib(P_predict, choice = "var", axes = 2, top = 6)
PC2: respondent’s English capability during both secondary school and high school
fviz_contrib(P_predict, choice = "var", axes = 3, top = 6)
PC3: respondent’s gender and their level of depression during both secondary school and high school
fviz_contrib(P_predict, choice = "var", axes = 4, top = 6)
PC4: whether or not respondent is US Citizen when they were in high school and the reasons their Dad came to US
fviz_contrib(P_predict, choice = "var", axes = 5, top = 6)
PC5: respondent’s gender during both secondary school and high school, their present living situation in 1995, and their guardian(s) when they were in secondary school.
In order to understand our dataset, we perform Exploratory Data Analysis on the ‘result’. A general report of this subset is attached with this report. In particular, we use histogram to see the distribution of values in each variable in the dataset and examines their correlations to each other:
plot_histogram(result)
plot_correlation(result)
Looking at the correlation plot, we noticed that the most variables do not strongly correlate with other variables in the dataset, except for “Present income satisfaction” and “Current Occupation Satisfaction”, which are understandable, so we might consider choosing one out of these two when calculating our success index. In general, we can be confident that the covariance problem is not likely to occur and these variables would be helpful components in our success formula. We proceeded by performing PCA on our data:
P_result<-prcomp(result[2:15],scale = TRUE)
fviz_pca_biplot(P_result, repel = FALSE, # Avoid text overlapping (doesn't scale to large datasets)
col.var = "red", # Variables color
col.ind = "contrib",
geom = 'point') + theme_minimal()
fviz_eig(P_result, addlabels = TRUE)
P_result
## Standard deviations (1, .., p=14):
## [1] 1.3806283 1.2988364 1.1182526 1.0616080 1.0400941 1.0147767 1.0023476
## [8] 0.9553649 0.9308578 0.9054144 0.8592021 0.7802670 0.7207285 0.6690536
##
## Rotation (n x k) = (14 x 14):
## PC1 PC2 PC3
## Residence Own house/aprt 0.011324027 -0.150881948 0.130084151
## Disabled or Ill 0.197977987 -0.004720986 0.435527802
## Highest education completed 0.527303288 0.127574324 0.096174817
## Present work situation -0.133109839 -0.072664109 -0.153639326
## Current job prestige scores 0.465852144 -0.109110399 0.296961314
## Current occupation satisfaction 0.156237203 -0.548895190 -0.348097455
## Present income satisfaction 0.074491646 -0.607176288 -0.249495560
## Respodent identity importance 0.004620463 -0.154556536 0.131299776
## Respodent detention/jail/prison 0.303133954 -0.013749608 0.175342991
## Country feels like home 0.072954176 -0.139241812 0.218968294
## Respondent health 0.248825420 -0.184273782 0.008583615
## Has Children -0.404294590 -0.254503929 0.408137846
## Average English Skill 0.163580287 -0.137140929 -0.042888249
## MariageStatus -0.261861868 -0.335858971 0.473826576
## PC4 PC5 PC6
## Residence Own house/aprt -0.458027940 0.16955035 -0.383918125
## Disabled or Ill 0.339423519 -0.07975109 -0.099754657
## Highest education completed -0.005844227 -0.04889315 0.193256540
## Present work situation 0.167267786 -0.52166064 -0.570656465
## Current job prestige scores 0.047318828 -0.05840879 0.135379310
## Current occupation satisfaction 0.183629316 0.08623116 0.126885723
## Present income satisfaction 0.198987885 0.07739785 0.006679392
## Respodent identity importance -0.254084353 0.63716388 -0.166743868
## Respodent detention/jail/prison 0.290068374 0.10921361 -0.267342309
## Country feels like home -0.062666763 0.01091477 -0.487467572
## Respondent health -0.316912494 -0.30516907 -0.013149360
## Has Children 0.037164220 -0.06794128 0.248372522
## Average English Skill -0.562990345 -0.35320140 0.149692161
## MariageStatus -0.005996583 -0.18169835 0.154974873
## PC7 PC8 PC9
## Residence Own house/aprt 0.42384567 0.49697450 -0.15895170
## Disabled or Ill 0.33933096 -0.25714645 0.46096035
## Highest education completed 0.04061613 0.17187536 -0.01088420
## Present work situation 0.30507493 -0.17922363 -0.01355746
## Current job prestige scores 0.13891275 0.20442603 0.07073380
## Current occupation satisfaction 0.03603939 0.01567142 0.08001240
## Present income satisfaction 0.03211578 0.11023686 0.02650730
## Respodent identity importance 0.11390377 -0.55203582 0.10208499
## Respodent detention/jail/prison -0.10650649 -0.17473548 -0.73525622
## Country feels like home -0.64185363 0.24391431 0.36453580
## Respondent health -0.35733832 -0.35611721 -0.08600594
## Has Children -0.01394806 0.01302529 -0.04436278
## Average English Skill 0.15062313 -0.22014953 0.02725211
## MariageStatus -0.04467244 0.02858085 -0.23626806
## PC10 PC11 PC12
## Residence Own house/aprt 0.262909047 -0.203425526 -0.00612239
## Disabled or Ill 0.151297219 -0.455096748 0.11637195
## Highest education completed -0.037445186 0.300320830 0.42610228
## Present work situation -0.074071230 0.412574467 -0.02312293
## Current job prestige scores -0.061735534 0.406785772 -0.51766632
## Current occupation satisfaction -0.016241625 -0.124744142 0.07304897
## Present income satisfaction -0.004063961 -0.002602621 0.01126027
## Respodent identity importance -0.079488374 0.348064011 0.04239875
## Respodent detention/jail/prison -0.224335766 -0.265335794 -0.07512013
## Country feels like home -0.271506568 -0.029474597 0.05089927
## Respondent health 0.644414239 -0.037642786 -0.15360769
## Has Children -0.118789327 -0.019695715 -0.43238919
## Average English Skill -0.576996381 -0.269667539 0.03664367
## MariageStatus 0.065890210 0.207620569 0.55583359
## PC13 PC14
## Residence Own house/aprt 0.086601700 0.065723503
## Disabled or Ill -0.002168778 -0.047677422
## Highest education completed 0.592957768 -0.024239801
## Present work situation 0.140565522 0.079160088
## Current job prestige scores -0.390601256 0.054669361
## Current occupation satisfaction 0.035192436 0.682298085
## Present income satisfaction 0.041357563 -0.708633853
## Respodent identity importance 0.054926426 0.009519455
## Respodent detention/jail/prison 0.023236396 0.015684115
## Country feels like home 0.032162911 0.055722783
## Respondent health 0.088546856 -0.017061294
## Has Children 0.574089554 0.064348346
## Average English Skill -0.067558128 -0.070917362
## MariageStatus -0.345028705 0.054491593
From the scree plot above, we can see that the first three principle components contain 38.6% of the variability in the data before the leveling-off point. This suggests plotting that the data in three derived-dimensions is an effective summary of all of the variables. The next step in our analysis is to investigate which variables are contributing heavily to each of these three components. Looking at the top 6 components for each dimension in these plot of contributions below, we notice the most-contributed components in each dimension:
fviz_contrib(P_result, choice = "var", axes = 1, top = 6)
PC1: respondent’s job satisfaction and the prestige-ness of the job.
fviz_contrib(P_result, choice = "var", axes = 2, top = 6)
PC2: respondent’s income satisfaction and their highest level of education
fviz_contrib(P_result, choice = "var", axes = 3, top = 6)
PC3: how much does the respondent value their identity and the country they consider home
fviz_contrib(P_result, choice = "var", axes = 4, top = 6)
Inititially we plan to use a linear regression model, however due to the characteristics of our data which involves large number of explantory variables, we need an alternative that can either implicitly select variables with strong effects, improve accuracy or flexibility and avoid overfitting problems. Two models that we are considering is the basic linear regression, LASSO (least absolute shrinkage and selection operation) and GAM (Generalized Additive Model) and PRC(Principle Components Regression). We proceed by fitting and evaluating these models using out-of-sample cross-validation to compare the RMSE of each model.